Quantitative and statistical methods can be applied to any type of dataset for a myriad of reasons. Being well versed with these skills is vitally important for successful data scientist. Further to the point, knowing how to correctly implement methods on a dataset will ensure that any statistical findings or predictions will be the most useful. The purpose of this case study is to combine the methods covered in this course to a mock business scenario.
This case study will be a business problem such that given a set of data a full analysis will be performed, and a statistical model will be developed to generate predictions on this data. The penalties imposed for incorrect prediction varying, with a more severe penalty for a incorrectly predicted positive. A false positive will result in a loss of $225, where a false negative will result with a loss of $35. The model will be built with the obvious intent of minimizing the amount lost. The data is composed of 50 feature variables with a binomial target variable with 0 indicating a “negative” and 1 indicates a “positive”.
The dataset is composed of 45 numeric and 5 categorical features with nondescript attribute names with no associated metadata. After a brief inspection of the variable names and types, the missing variables were visualized to assess the data quality.
Fig 3.1 Missing variable plot
As represented in figure 3.1, there were some missing data points in all of the non-target features, though not many. Further investigation of the data was needed before making a decision on how to handle these missing data points.
Of the few character attributes in the dataset it is plain to see what some of the variables indicate. There is a day of the week, a month column, something that eludes to a dollar amount, a percentage, and location.
Fig 3.2 Density Plots of numeric variables
Initially the feature which indicated the dollar and percentage amounts had the dollar and percentage symbol removed, so these variables could be treated as continuous numeric. Inspection of density plots in figure 3.2 showed that nearly all of these variables are normally distributed. The feature which was initially the percentage column shows that the data fall into distinct bins. Due to this, that variable was left as it was represented initially in the dataset as a categorical feature.
Once the data was in the correct format, it showed that even the categorical variables were distributed normally, for example, the most observations occur on Wednesday and the fewest on Monday/Friday. It is because of this that the missing variables in the dataset were replaced with a value that was chosen at random from the rest of the column (Appendix 6.2.1).
Fig 3.3 Correlation matrix of numeric variables
The next step was to evaluate the data to find any associations that can be leverage to the advantage of the model. Using a the matrix shown in figure 3.3, the variables which correlated were identified.
Some of the combinations of variables were plotted to see if there were any relationships between these data which could be leveraged. The most compelling of these plots are shown in subsequent figures.
Fig 3.4
Figure 3.4 showing the scatter plot between the x36 and x40 variable, there seems like there could be some separation of the target from the x40 perspective. Though not a perfect delineation it seems like there are more positives with x40 greater than 0, and more negatives below zero.
Fig 3.5
Figure 3.5 shows a similar pattern with the x23 variable; it looks somewhat as though there are more positives with x23 < 0.
Fig 3.6
When combining these two features on the same scatter plot, it seems that this notions presented above are reinforced. Figure 3.6 shows that it looks like a majority of the positives are in the quadrant with x40 > 0 and x23 < 0, this knowledge can be exploited later to our advantage.
Preliminary model testing involved using a few separate methods, but ultimately the XGBoost tree-based classifier was chosen for performance reasons. This type of model will make multiple decision trees for a number of rounds from the dataset, and classify the data. Each round, the error will be calculated, and the model will adjust in an attempt to make a more accurate prediction. Due to the nature of this model, creating new features from this data may help the model make better classifications. After careful inspection of the data, several new features were created.
One feature created pertained to the x37 column; this attribute initially held a dollar sign. These values were normally distributed with a zero mean and it was assumed that these values correlated with some monetary value. It is possible that there could have been information lost regarding relative amount of money gained or lost by transforming the data. The data was binned by assigning one of the following tags (“high_neg”,“low_neg”, “mid”, “low_pos”, “high_pos”) depending on the amount in column x37 (Appendix 6.2.2).
Another pair of features created pertain to findings from the EDA section. The variables x23 and x40 may have some type of separation in the data which will allow for better classification. Features were created from both of these with columns that indicate if x23/x40 is positive or negative (Appendix 6.2.3). Though this information is obviously already present in the data, the tree method used may still benefit.
Finally, it was also assumed that the feature which contain (basically) the names of months pertained to months of the year. This was exploited to create a “quarter” column, alluding to the quarter of the year which it occurs (Appendix 6.2.4).
With the dataset prepared, statistical modeling can commence. As previously stated, preliminary investigation revealed that an XGBoost tree-based classifier yielded the best performance metrics on a sampling of data. To accomplish this task, the caret package in R was utilized. The XGBoost algorithm requires numeric format from the variables, so the dataset will need a little pre-processing; categorical variables were one-hot encoded, and though not required for XGBoost, the numeric data was centered and scaled. This numeric transformation yielded a slight performance boost in initial models, so it was maintained throughout (Appendix 6.2.5).
Once the data has been processed it can be fed into the model. To tune this model, a 5-fold grid search cross validation was performed to find the best parameters. To optimize the model, a custom optimization function was used (Appendix 6.2.6) The challenge is to minimize the amount of false positives, and to do so the XGBoost model was optimized so that the precision on the testing data is to be maximized. This should ensure that the final model will aim to limit the amount of false positives.
Parameter tuning was accomplished with the caret package as previously stated by manipulating and testing different values associated with the algorithm. To tune this model several parameters were tested; nrounds will control the number of boosting iterations that the algorithm will attempt, max_depth controls the depth of the decision trees created, eta controls how the model learns from the data to optimize the predictions, gamma controls how conservative the algorithm will perform. Multiple combinations were tried to perform a relatively exhaustive search to find optimal parameters with the entire dataset (Appendix 6.2.7).
The final model was saved as an .rds file and included in the data folder turned in during submission for inspection, if desired.
Full results of model tuning can be found in the appendix (Appendix 6.2.8). Table 4.1 below shows the parameters of the best tune from cross-validation.
Table 4.1: Best XGBoost Tune
| Parameter | Value |
|---|---|
| nrounds | 150 |
| max_depth | 18 |
| eta | 0.2 |
| gamma | 1 |
Fig 4.1 Cross-validation results
Figure 4.1 above shows the precision of the model on the testing data during cross-validation. It seems that increasing the number of boosting iterations did little to improve performance as the number of max trees increased, which is why the algorithm chose the smallest number in the grid.
As expected, the greater value of gamma resulted in a more conservative algorithm which sacrificed model performance. Though the model did quite well at the higher value of gamma, it still held it back from reaching optimal levels.
Clearly the greatest contribution to model performance came from increase the maximum depth of the trees. Increasing the maximum depth passed 18 was considered, though it could lead to over-fitting of the data. The default depth for this algorithm is 6, so it seemed risky to increase it any further.
The eta or learning rate performed well at all levels tested, with a comparable performance at each value tested.
The best model performed well on the test data during cross-validation. As the model summary shows, the data was split into 5 folds of roughly 128,000 samples, then tested on the remaining 32,000. Minimum loss calculations for the objective of this case study were calculated from the metrics collected during cross validation. The optimal precision on test data had a mean score of 94.7%.
Finally, to assess the amount lost by this model the Type I and Type II error rates were calculated during cross validation. These metrics were calculated from 5 iterations of cross-validation on a testing set of 32,000 samples. The final model produced a Type I error rate, or false positive percentage of 3.22% and a Type II error rate, or false negative percentage of 2.43%. So with this in mind the loss from 32,000 test samples produced from this model totaled roughly $258807
Statistical methods can be applied in many different ways to optimize processes. This case study demonstrated a comprehensive analysis of a dataset complete with building a classifier to generate predictions. Any information garnered from these methods must be looked at with perspective. It can be alluring to see great metric and believe that this will translate to the real world. In an ideal scenario, models would perform exceptionally in the real world as they do in testing and development but this may not always be the case. It is up to the data scientist to set positive, reasonable expectations of statistical models, and to understand that these are just tools which need correct understanding and implementation in order to be effective.
https://www.jdatalab.com/data_science_and_data_mining/2017/01/30/data-binning-plot.html
https://machinelearningmastery.com/pre-process-your-dataset-in-r/
https://topepo.github.io/caret/model-training-and-tuning.html#model-training-and-parameter-tuning
https://github.com/dmlc/xgboost/tree/master/R-package/demo
https://github.com/dmlc/xgboost/blob/master/R-package/demo/caret_wrapper.R
https://machinelearningmastery.com/finalize-machine-learning-models-in-r/
https://www.analyticsvidhya.com/blog/2020/11/a-tour-of-evaluation-metrics-for-machine-learning/
# This copies the data and fills with random value
data2 <- data
data2[] <- lapply(data2,function(x) replace(x,is.na(x), sample(x[!is.na(x)],sum(is.na(x)))))
data_imp <- cbind(data2, target)
data <- data_imp
breaks <- c(-4735,-1600,-550,550,1600,5550)
tags <- c("high_neg","low_neg", "mid", "low_pos", "high_pos")
group_tags <- cut(data$x37,
breaks=breaks,
include.lowest=TRUE,
right=FALSE,
labels=tags)
v <- data %>% select(x37)
vgroup <- as_tibble(v) %>%
mutate(tag = case_when(
x37 < -1600 ~ tags[1],
x37 >= -1600 & x37 < -550 ~ tags[2],
x37 >= -550 & x37 < 550 ~ tags[3],
x37 >= 550 & x37 < 1600 ~ tags[4],
x37 >= 1600 ~ tags[5],
is.na(x37) ~ "none"))
x23_tag <- c("neg", "pos")
x23_v <- data %>% select(x23)
x23_vgroup <- as_tibble(x23_v) %>%
mutate(x23_tag = case_when(
x23 < 0 ~ x23_tag[1],
x23 >= 0 ~ x23_tag[2]))
x40_v <- data %>% select(x40)
x40_vgroup <- as_tibble(x40_v) %>%
mutate(x40_tag = case_when(
x40 < 0 ~ x23_tag[1],
x40 >= 0 ~ x23_tag[2]))
first_quarter = c("January", "Feb", "Mar")
second_quarter = c("Apr", "May", "Jun")
third_quarter = c("July", "Aug", "sept.")
fourth_quarter = c("Oct", "Nov", "Dev")
quarter_col <- ifelse(data$x29 %in% first_quarter, "first_qtr",
ifelse(data$x29 %in% second_quarter, "second_qtr",
ifelse(data$x29 %in% third_quarter, "third_qtr",
ifelse(data$x29 %in% fourth_quarter, "fourth_qtr",
NA ))))
ohe_cols <- names(data %>% select_if(is.factor))
numeric_cols <- names(data %>% select_if(is.numeric) %>% select(-y))
dummies <- dummyVars( ~ ., data = data3[,ohe_cols], sep="_")
data_ohe <- predict(dummies, newdata = data3[,ohe_cols])
preprosser <- preProcess(data3[, numeric_cols], method=c("center", "scale"))
data_scl <- predict(preprosser, data3[, numeric_cols])
data_fin <- cbind(data_ohe, data_scl, target)
names(data_fin) <- gsub("%", "", names(data_fin), fixed = TRUE)
names(data_fin) <- gsub("-", "neg", names(data_fin), fixed = TRUE)
caret Optimization Functionf1 <- function(data, lev = NULL, model = NULL) {
f1_val <- F1_Score(y_pred = data$pred, y_true = data$obs, positive = lev[1])
p <- Precision(y_pred = data$pred, y_true = data$obs, positive = lev[1])
r <- Recall(y_pred = data$pred, y_true = data$obs, positive = lev[1])
fp <-sum(data$pred==0 & data$obs==1)/length(data$pred)
fn <-sum(data$pred==1 & data$obs==0)/length(data$pred)
c(F1 = f1_val,
prec = p,
rec = r,
Type_I_err=fp,
Type_II_err=fn
)
}
fitControl <- trainControl(method = "cv",
number = 5,
verbose = T,
summaryFunction = f1)
xgb_grid <- expand.grid(nrounds = c(150, 200, 250, 300),
max_depth = c(6, 12, 18),
eta = c(0.2, 0.25, 0.3, 0.35),
gamma=c(1, 10),
colsample_bytree=1,
min_child_weight=1,
subsample=1)
set.seed(42)
xgb_fit <- train(as.factor(y) ~ ., data = data_fin,
method = 'xgbTree',
trControl = fitControl,
verbose = T,
tuneGrid = xgb_grid,
metric = "prec",
maximize = T)
## eXtreme Gradient Boosting
##
## 160000 samples
## 91 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 128000, 128000, 128000, 128001, 127999
## Resampling results across tuning parameters:
##
## eta max_depth gamma nrounds F1 prec rec Type_I_err
## 0.20 6 1 150 0.9366348 0.9286198 0.9447930 0.04348749
## 0.20 6 1 200 0.9392325 0.9320336 0.9465466 0.04133123
## 0.20 6 1 250 0.9409647 0.9344444 0.9475800 0.03980623
## 0.20 6 1 300 0.9415439 0.9350284 0.9481541 0.03944998
## 0.20 6 10 150 0.9279803 0.9173861 0.9388328 0.05063124
## 0.20 6 10 200 0.9279803 0.9173861 0.9388328 0.05063124
## 0.20 6 10 250 0.9279803 0.9173861 0.9388328 0.05063124
## 0.20 6 10 300 0.9279803 0.9173861 0.9388328 0.05063124
## 0.20 12 1 150 0.9527639 0.9466134 0.9589992 0.03238749
## 0.20 12 1 200 0.9530171 0.9468779 0.9592393 0.03222499
## 0.20 12 1 250 0.9530220 0.9468876 0.9592393 0.03221874
## 0.20 12 1 300 0.9530220 0.9468876 0.9592393 0.03221874
## 0.20 12 10 150 0.9448410 0.9372203 0.9525902 0.03821249
## 0.20 12 10 200 0.9448410 0.9372203 0.9525902 0.03821249
## 0.20 12 10 250 0.9448410 0.9372203 0.9525902 0.03821249
## 0.20 12 10 300 0.9448410 0.9372203 0.9525902 0.03821249
## 0.20 18 1 150 0.9531437 0.9469873 0.9593854 0.03216249
## 0.20 18 1 200 0.9531437 0.9469873 0.9593854 0.03216249
## 0.20 18 1 250 0.9531437 0.9469873 0.9593854 0.03216249
## 0.20 18 1 300 0.9531437 0.9469873 0.9593854 0.03216249
## 0.20 18 10 150 0.9476366 0.9404074 0.9549805 0.03623750
## 0.20 18 10 200 0.9476366 0.9404074 0.9549805 0.03623750
## 0.20 18 10 250 0.9476366 0.9404074 0.9549805 0.03623750
## 0.20 18 10 300 0.9476366 0.9404074 0.9549805 0.03623750
## 0.25 6 1 150 0.9384258 0.9310741 0.9458994 0.04193125
## 0.25 6 1 200 0.9407438 0.9342923 0.9472877 0.03989375
## 0.25 6 1 250 0.9418834 0.9356167 0.9482376 0.03907498
## 0.25 6 1 300 0.9425364 0.9363350 0.9488221 0.03863123
## 0.25 6 10 150 0.9277930 0.9172689 0.9385719 0.05069999
## 0.25 6 10 200 0.9277930 0.9172689 0.9385719 0.05069999
## 0.25 6 10 250 0.9277930 0.9172689 0.9385719 0.05069999
## 0.25 6 10 300 0.9277930 0.9172689 0.9385719 0.05069999
## 0.25 12 1 150 0.9523839 0.9468121 0.9580285 0.03223122
## 0.25 12 1 200 0.9523124 0.9467215 0.9579763 0.03228748
## 0.25 12 1 250 0.9523124 0.9467215 0.9579763 0.03228748
## 0.25 12 1 300 0.9523124 0.9467215 0.9579763 0.03228748
## 0.25 12 10 150 0.9443111 0.9376811 0.9510454 0.03785625
## 0.25 12 10 200 0.9443111 0.9376811 0.9510454 0.03785625
## 0.25 12 10 250 0.9443111 0.9376811 0.9510454 0.03785625
## 0.25 12 10 300 0.9443111 0.9376811 0.9510454 0.03785625
## 0.25 18 1 150 0.9521832 0.9457515 0.9587070 0.03293124
## 0.25 18 1 200 0.9521832 0.9457515 0.9587070 0.03293124
## 0.25 18 1 250 0.9521832 0.9457515 0.9587070 0.03293124
## 0.25 18 1 300 0.9521832 0.9457515 0.9587070 0.03293124
## 0.25 18 10 150 0.9471999 0.9402767 0.9542290 0.03629374
## 0.25 18 10 200 0.9471999 0.9402767 0.9542290 0.03629374
## 0.25 18 10 250 0.9471999 0.9402767 0.9542290 0.03629374
## 0.25 18 10 300 0.9471999 0.9402767 0.9542290 0.03629374
## 0.30 6 1 150 0.9390651 0.9325234 0.9457011 0.04097499
## 0.30 6 1 200 0.9413195 0.9351756 0.9475486 0.03933124
## 0.30 6 1 250 0.9435722 0.9380461 0.9491665 0.03753749
## 0.30 6 1 300 0.9441409 0.9384778 0.9498763 0.03728748
## 0.30 6 10 150 0.9285870 0.9188193 0.9385719 0.04966250
## 0.30 6 10 200 0.9285870 0.9188193 0.9385719 0.04966250
## 0.30 6 10 250 0.9285870 0.9188193 0.9385719 0.04966250
## 0.30 6 10 300 0.9285870 0.9188193 0.9385719 0.04966250
## 0.30 12 1 150 0.9517952 0.9457702 0.9579032 0.03289374
## 0.30 12 1 200 0.9517952 0.9457702 0.9579032 0.03289374
## 0.30 12 1 250 0.9517952 0.9457702 0.9579032 0.03289374
## 0.30 12 1 300 0.9517952 0.9457702 0.9579032 0.03289374
## 0.30 12 10 150 0.9444198 0.9375733 0.9513690 0.03793125
## 0.30 12 10 200 0.9444198 0.9375733 0.9513690 0.03793125
## 0.30 12 10 250 0.9444198 0.9375733 0.9513690 0.03793125
## 0.30 12 10 300 0.9444198 0.9375733 0.9513690 0.03793125
## 0.30 18 1 150 0.9518367 0.9452809 0.9584878 0.03322498
## 0.30 18 1 200 0.9518367 0.9452809 0.9584878 0.03322498
## 0.30 18 1 250 0.9518367 0.9452809 0.9584878 0.03322498
## 0.30 18 1 300 0.9518367 0.9452809 0.9584878 0.03322498
## 0.30 18 10 150 0.9454286 0.9386909 0.9522667 0.03724375
## 0.30 18 10 200 0.9454286 0.9386909 0.9522667 0.03724375
## 0.30 18 10 250 0.9454286 0.9386909 0.9522667 0.03724375
## 0.30 18 10 300 0.9454286 0.9386909 0.9522667 0.03724375
## 0.35 6 1 150 0.9406066 0.9343464 0.9469537 0.03984373
## 0.35 6 1 200 0.9423928 0.9362350 0.9486342 0.03868748
## 0.35 6 1 250 0.9430329 0.9374183 0.9487177 0.03792499
## 0.35 6 1 300 0.9434397 0.9380602 0.9488847 0.03751873
## 0.35 6 10 150 0.9273459 0.9183409 0.9365469 0.04988749
## 0.35 6 10 200 0.9273459 0.9183409 0.9365469 0.04988749
## 0.35 6 10 250 0.9273459 0.9183409 0.9365469 0.04988749
## 0.35 6 10 300 0.9273459 0.9183409 0.9365469 0.04988749
## 0.35 12 1 150 0.9503826 0.9445695 0.9562749 0.03360624
## 0.35 12 1 200 0.9503826 0.9445695 0.9562749 0.03360624
## 0.35 12 1 250 0.9503826 0.9445695 0.9562749 0.03360624
## 0.35 12 1 300 0.9503826 0.9445695 0.9562749 0.03360624
## 0.35 12 10 150 0.9432392 0.9361525 0.9504400 0.03881874
## 0.35 12 10 200 0.9432392 0.9361525 0.9504400 0.03881874
## 0.35 12 10 250 0.9432392 0.9361525 0.9504400 0.03881874
## 0.35 12 10 300 0.9432392 0.9361525 0.9504400 0.03881874
## 0.35 18 1 150 0.9510953 0.9448431 0.9574335 0.03346874
## 0.35 18 1 200 0.9510953 0.9448431 0.9574335 0.03346874
## 0.35 18 1 250 0.9510953 0.9448431 0.9574335 0.03346874
## 0.35 18 1 300 0.9510953 0.9448431 0.9574335 0.03346874
## 0.35 18 10 150 0.9454644 0.9390641 0.9519535 0.03698749
## 0.35 18 10 200 0.9454644 0.9390641 0.9519535 0.03698749
## 0.35 18 10 250 0.9454644 0.9390641 0.9519535 0.03698749
## 0.35 18 10 300 0.9454644 0.9390641 0.9519535 0.03698749
## Type_II_err
## 0.03305624
## 0.03200624
## 0.03138749
## 0.03104374
## 0.03662499
## 0.03662499
## 0.03662499
## 0.03662499
## 0.02455000
## 0.02440625
## 0.02440625
## 0.02440625
## 0.02838749
## 0.02838749
## 0.02838749
## 0.02838749
## 0.02431874
## 0.02431874
## 0.02431874
## 0.02431874
## 0.02695626
## 0.02695626
## 0.02695626
## 0.02695626
## 0.03239374
## 0.03156249
## 0.03099374
## 0.03064374
## 0.03678124
## 0.03678124
## 0.03678124
## 0.03678124
## 0.02513124
## 0.02516249
## 0.02516249
## 0.02516249
## 0.02931250
## 0.02931250
## 0.02931250
## 0.02931250
## 0.02472499
## 0.02472499
## 0.02472499
## 0.02472499
## 0.02740625
## 0.02740625
## 0.02740625
## 0.02740625
## 0.03251249
## 0.03140625
## 0.03043750
## 0.03001249
## 0.03678125
## 0.03678125
## 0.03678125
## 0.03678125
## 0.02520624
## 0.02520624
## 0.02520624
## 0.02520624
## 0.02911875
## 0.02911875
## 0.02911875
## 0.02911875
## 0.02485625
## 0.02485625
## 0.02485625
## 0.02485625
## 0.02858125
## 0.02858125
## 0.02858125
## 0.02858125
## 0.03176249
## 0.03075625
## 0.03070625
## 0.03060625
## 0.03799375
## 0.03799375
## 0.03799375
## 0.03799375
## 0.02618124
## 0.02618124
## 0.02618124
## 0.02618124
## 0.02967500
## 0.02967500
## 0.02967500
## 0.02967500
## 0.02548749
## 0.02548749
## 0.02548749
## 0.02548749
## 0.02876874
## 0.02876874
## 0.02876874
## 0.02876874
##
## Tuning parameter 'colsample_bytree' was held constant at a value of 1
##
## Tuning parameter 'min_child_weight' was held constant at a value of 1
##
## Tuning parameter 'subsample' was held constant at a value of 1
## prec was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 150, max_depth = 18, eta
## = 0.2, gamma = 1, colsample_bytree = 1, min_child_weight = 1 and subsample = 1.
rf_grid <- expand.grid(mtry = c(20, 40, 78))
nrow(rf_grid)
set.seed(42)
rf_fit <- train(as.factor(y) ~ ., data = training,
method = 'rf',
trControl = fitControl,
verbose = T,
## Now specify the exact models
## to evaluate:
tuneGrid = rf_grid,
metric = "prec",
maximize = T)
rf_fit
fitControl <- trainControl(## 10-fold CV
method = "cv",
number = 5,
verbose = T)#),
## repeated ten times
#repeats = 10)
xgb_grid <- expand.grid(nrounds = c(50, 100, 200),
max_depth = c(2, 6, 10),
eta = c(0.1, 0.3, 0.5),
gamma=1,
colsample_bytree=1,
min_child_weight=1,
subsample=1)
nrow(xgb_grid)
set.seed(42)
xgb_fit <- train(as.factor(y) ~ ., data = training,
method = 'xgbTree',
trControl = fitControl,
verbose = T,
## Now specify the exact models
## to evaluate:
tuneGrid = xgb_grid)
xgb_fit
sklearnimport pandas as pd
import numpy as np
import xgboost as xgb
#from sklearn.ensemble import AdaBoostClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.metrics import classification_report, confusion_matrix
data=r.data2.copy()
target = data.y
data = data.drop('y', axis=1)
data_ohe = pd.get_dummies(data, drop_first=True)
numeric_features = list(data.select_dtypes(include="float64").columns)
numeric_transformer = Pipeline(steps=[
#('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = list(data.select_dtypes(exclude="float64").columns)
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
xgb_model = xgb.XGBClassifier(use_label_encoder=False, verbosity=2)
clf_XG_pipe = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', xgb_model)])
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=0)
clf_XG_pipe.fit(X_train, y_train)
print("model score: %.3f" % clf_XG_pipe.score(X_test, y_test))
data=r.data2.copy()
target = data.y
data = data.drop('y', axis=1)
data_ohe = pd.get_dummies(data, drop_first=True)
#clf_ada = AdaBoostClassifier(n_estimators=100)
xgb_model = xgb.XGBClassifier(use_label_encoder=False)
numeric_features = list(data.select_dtypes(include="float64").columns)
numeric_transformer = Pipeline(steps=[
#('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = list(data.select_dtypes(exclude="float64").columns)
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf_XG_pipe = Pipeline(steps=[('preprocessor', preprocessor)])
X_train, X_test, y_train, y_test = train_test_split(data_ohe, target, test_size=0.2, random_state=0)
#X_train_xg = clf_XG_pipe.fit_transform(X_train)
#X_test_xg = clf_XG_pipe.fit_transform(X_test)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
param = {'use_label_encoder':False, 'verbosity':2, 'objective':'binary:logistic'}
bst = xgb.train(dtrain=dtrain, params=param)
preds = bst.predict(dtest)
a = np.where(preds > 0.5, 1, 0)
labels = dtest.get_label()
print('error=%f' % (sum(1 for i in range(len(preds)) if int(preds[i] > 0.5) != labels[i]) / float(len(preds))))